Part III: Machine Learning¶

1.Preprocessing¶

Import packages and load data¶

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.cm as cm # colormaps
from sklearn.metrics import silhouette_samples, silhouette_score
from sklearn.cluster import KMeans
from scipy.cluster.hierarchy import dendrogram
from sklearn.cluster import AgglomerativeClustering
import seaborn as sns
In [2]:
business_file = '../project_data/yelp_business.csv'
review_file = '../project_data/yelp_review.csv'

# business_file = '../project_data/yelp_business.csv'
# review_file = 'yelp_review.csv'

# Load the business data and review data
business_df = pd.read_csv(business_file)
review_df = pd.read_csv(review_file)

Handle Missing Values¶

Dropping Missing Values in Data:

  • Rows in business_df with missing values in the categories column were dropped.
  • Rows in review_df with missing values in the text column were dropped
  • This step ensures that all businesses have valid category information, which is essential for filtering later.
In [3]:
business_df = business_df.dropna(subset=['categories'])
review_df = review_df.dropna(subset=['text'])
review_df = review_df.sample(frac=0.01, random_state=42)
print(business_df.isnull().sum())
print(review_df.isnull().sum())
print('After dropping NA: business data :', business_df.shape)
print('After dropping NA: review data :', review_df.shape)
business_id         0
name                0
address          5126
city                0
state               0
postal_code        73
latitude            0
longitude           0
stars               0
review_count        0
is_open             0
attributes      13642
categories          0
hours           23120
dtype: int64
review_id      0
user_id        0
business_id    0
stars          0
useful         0
funny          0
cool           0
text           0
date           0
dtype: int64
After dropping NA: business data : (150243, 14)
After dropping NA: review data : (69903, 9)

Filtering for Restaurants:

  • Filtered businesses with "Restaurants" in their categories column into restaurant_df.
  • Retained reviews in review_df whose business_id matched restaurant_df.
  • This ensures the dataset focuses only on restaurants.
In [4]:
# Filter out businesses that are not restaurants
restaurant_df = business_df[business_df['categories'].str.contains('Restaurants')]
restaurant_ids = restaurant_df['business_id'].values

# Filter out reviews that are not for restaurants
restaurant_review_df = review_df[review_df['business_id'].isin(restaurant_ids)]

print('Number of reviews for restaurants:', len(restaurant_review_df))
print(review_df.head())
print(business_df.head())
Number of reviews for restaurants: 47195
                      review_id                 user_id  \
1295256  J5Q1gH4ACCj6CtQG7Yom7g  56gL9KEJNHiSDUoyjk2o3Q   
3297618  HlXP79ecTquSVXmjM10QxQ  bAt9OUFX9ZRgGLCXG22UmA   
1217795  JBBULrjyGx6vHto2osk_CQ  NRHPcLq2vGWqgqwVugSgnQ   
3730348  U9-43s8YUl6GWBFCpxUGEw  PAxc0qpqt5c2kA0rjDFFAg   
1826590  8T8EGa_4Cj12M6w8vRgUsQ  BqPR1Dp5Rb_QYs9_fz9RiA   

                    business_id  stars  useful  funny  cool  \
1295256  8yR12PNSMo6FBYx1u5KPlw    2.0       1      0     0   
3297618  pBNucviUkNsiqhJv5IFpjg    5.0       0      0     0   
1217795  8sf9kv6O4GgEb0j1o22N1g    5.0       0      0     0   
3730348  XwepyB7KjJ-XGJf0vKc6Vg    4.0       0      0     0   
1826590  prm5wvpp0OHJBlrvTj9uOg    5.0       0      0     0   

                                                      text  \
1295256  Went for lunch and found that my burger was me...   
3297618  I needed a new tires for my wife's car. They h...   
1217795  Jim Woltman who works at Goleta Honda is 5 sta...   
3730348  Been here a few times to get some shrimp.  The...   
1826590  This is one fantastic place to eat whether you...   

                        date  
1295256  2018-04-04 21:09:53  
3297618  2020-05-24 12:22:14  
1217795  2019-02-14 03:47:48  
3730348  2013-04-27 01:55:49  
1826590  2019-05-15 18:29:25  
              business_id                      name  \
0  Pns2l4eNsfO8kk83dixA6A  Abby Rappoport, LAC, CMQ   
1  mpf3x-BjTdTEA3yCZrAYPw             The UPS Store   
2  tUFrWirKiKi_TAnsVWINQQ                    Target   
3  MTSW4McQd7CbVtyjqoe9mw        St Honore Pastries   
4  mWMc6_wTdE0EUBKIGXDVfA  Perkiomen Valley Brewery   

                           address           city state postal_code  \
0           1616 Chapala St, Ste 2  Santa Barbara    CA       93101   
1  87 Grasso Plaza Shopping Center         Affton    MO       63123   
2             5255 E Broadway Blvd         Tucson    AZ       85711   
3                      935 Race St   Philadelphia    PA       19107   
4                    101 Walnut St     Green Lane    PA       18054   

    latitude   longitude  stars  review_count  is_open  \
0  34.426679 -119.711197    5.0             7        0   
1  38.551126  -90.335695    3.0            15        1   
2  32.223236 -110.880452    3.5            22        0   
3  39.955505  -75.155564    4.0            80        1   
4  40.338183  -75.471659    4.5            13        1   

                                          attributes  \
0                      {'ByAppointmentOnly': 'True'}   
1             {'BusinessAcceptsCreditCards': 'True'}   
2  {'BikeParking': 'True', 'BusinessAcceptsCredit...   
3  {'RestaurantsDelivery': 'False', 'OutdoorSeati...   
4  {'BusinessAcceptsCreditCards': 'True', 'Wheelc...   

                                          categories  \
0  Doctors, Traditional Chinese Medicine, Naturop...   
1  Shipping Centers, Local Services, Notaries, Ma...   
2  Department Stores, Shopping, Fashion, Home & G...   
3  Restaurants, Food, Bubble Tea, Coffee & Tea, B...   
4                          Brewpubs, Breweries, Food   

                                               hours  
0                                                NaN  
1  {'Monday': '0:0-0:0', 'Tuesday': '8:0-18:30', ...  
2  {'Monday': '8:0-22:0', 'Tuesday': '8:0-22:0', ...  
3  {'Monday': '7:0-20:0', 'Tuesday': '7:0-20:0', ...  
4  {'Wednesday': '14:0-22:0', 'Thursday': '16:0-2...  

Engineering for Reviews:

  • Convert date variable from string to pandas datatime object
  • Extract year, month, day, and hour of the review date
  • Extract the string length of the review
In [5]:
# Data engineering for review df
# Convert date to datetime format
restaurant_review_df['date'] = pd.to_datetime(
    restaurant_review_df['date'], errors='coerce')
# Extract month and year from the date
restaurant_review_df['year'] = restaurant_review_df['date'].dt.year
restaurant_review_df['month'] = restaurant_review_df['date'].dt.month
restaurant_review_df['day'] = restaurant_review_df['date'].dt.day
restaurant_review_df['review_hr'] = restaurant_review_df['date'].dt.hour
restaurant_review_df['review_length'] = \
    restaurant_review_df['text'].apply(len)
restaurant_review_df.info()
<class 'pandas.core.frame.DataFrame'>
Index: 47195 entries, 1295256 to 4428200
Data columns (total 14 columns):
 #   Column         Non-Null Count  Dtype         
---  ------         --------------  -----         
 0   review_id      47195 non-null  object        
 1   user_id        47195 non-null  object        
 2   business_id    47195 non-null  object        
 3   stars          47195 non-null  float64       
 4   useful         47195 non-null  int64         
 5   funny          47195 non-null  int64         
 6   cool           47195 non-null  int64         
 7   text           47195 non-null  object        
 8   date           47195 non-null  datetime64[ns]
 9   year           47195 non-null  int32         
 10  month          47195 non-null  int32         
 11  day            47195 non-null  int32         
 12  review_hr      47195 non-null  int32         
 13  review_length  47195 non-null  int64         
dtypes: datetime64[ns](1), float64(1), int32(4), int64(4), object(4)
memory usage: 4.7+ MB
/var/folders/1d/1xd79pj16zq3cp07nr95yrrr0000gn/T/ipykernel_21341/3095147233.py:3: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  restaurant_review_df['date'] = pd.to_datetime(
/var/folders/1d/1xd79pj16zq3cp07nr95yrrr0000gn/T/ipykernel_21341/3095147233.py:6: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  restaurant_review_df['year'] = restaurant_review_df['date'].dt.year
/var/folders/1d/1xd79pj16zq3cp07nr95yrrr0000gn/T/ipykernel_21341/3095147233.py:7: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  restaurant_review_df['month'] = restaurant_review_df['date'].dt.month
/var/folders/1d/1xd79pj16zq3cp07nr95yrrr0000gn/T/ipykernel_21341/3095147233.py:8: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  restaurant_review_df['day'] = restaurant_review_df['date'].dt.day
/var/folders/1d/1xd79pj16zq3cp07nr95yrrr0000gn/T/ipykernel_21341/3095147233.py:9: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  restaurant_review_df['review_hr'] = restaurant_review_df['date'].dt.hour
/var/folders/1d/1xd79pj16zq3cp07nr95yrrr0000gn/T/ipykernel_21341/3095147233.py:10: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  restaurant_review_df['review_length'] = \

Filtering for Business:

  • Extract the average opening hours from daily opening hours
In [6]:
#Data Engineering for business df
def dict_to_avg_hrs(d):
    '''
    Convert the dictionary to average hours per day
    '''
    def str_to_min(s): #convert time str to min
        return int(s.split(':')[0]) * 60 + int(s.split(':')[1])
    if pd.isnull(d):
        return np.nan
    else:
        d = eval(d) #convert string to dict
    hr_dict = {}
    for (day, d_info) in d.items():
        start, end = d_info.split("-") #get start and end time str
        start_min, end_min = str_to_min(start), str_to_min(end)
        total_hrs = (end_min - start_min) / 60
        if(total_hrs < 0): #if the end time is on the next day
            total_hrs += 24
        hr_dict[day] = total_hrs
    #get average hours per day
    return sum(hr_dict.values()) / len(hr_dict)
# Acquire restaurants average opening hours
restaurant_df['avg_opening_hrs'] = \
    restaurant_df['hours'].apply(dict_to_avg_hrs)
restaurant_df.info()
<class 'pandas.core.frame.DataFrame'>
Index: 52268 entries, 3 to 150340
Data columns (total 15 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   business_id      52268 non-null  object 
 1   name             52268 non-null  object 
 2   address          51825 non-null  object 
 3   city             52268 non-null  object 
 4   state            52268 non-null  object 
 5   postal_code      52247 non-null  object 
 6   latitude         52268 non-null  float64
 7   longitude        52268 non-null  float64
 8   stars            52268 non-null  float64
 9   review_count     52268 non-null  int64  
 10  is_open          52268 non-null  int64  
 11  attributes       51703 non-null  object 
 12  categories       52268 non-null  object 
 13  hours            44990 non-null  object 
 14  avg_opening_hrs  44990 non-null  float64
dtypes: float64(4), int64(2), object(9)
memory usage: 6.4+ MB
/var/folders/1d/1xd79pj16zq3cp07nr95yrrr0000gn/T/ipykernel_21341/1067097945.py:23: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  restaurant_df['avg_opening_hrs'] = \

Merging Reviews and Businesses:

  • Merged restaurant_review_df and restaurant_df on business_id using an inner join.
  • Renamed stars_x to review_stars and stars_y to business_stars for clarity.
  • Drop entries where average opening hours is null
In [7]:
merged_df = pd.merge(restaurant_review_df, restaurant_df, on='business_id', how='inner')
merged_df = merged_df.rename(columns={'stars_x': 'review_stars', 'stars_y': 'business_stars'})
#drop null avg_open_hrs to avoid errors in clustering/analysis
merged_df = merged_df[merged_df['avg_opening_hrs'].notnull()]
merged_df.info()
<class 'pandas.core.frame.DataFrame'>
Index: 45514 entries, 0 to 47193
Data columns (total 28 columns):
 #   Column           Non-Null Count  Dtype         
---  ------           --------------  -----         
 0   review_id        45514 non-null  object        
 1   user_id          45514 non-null  object        
 2   business_id      45514 non-null  object        
 3   review_stars     45514 non-null  float64       
 4   useful           45514 non-null  int64         
 5   funny            45514 non-null  int64         
 6   cool             45514 non-null  int64         
 7   text             45514 non-null  object        
 8   date             45514 non-null  datetime64[ns]
 9   year             45514 non-null  int32         
 10  month            45514 non-null  int32         
 11  day              45514 non-null  int32         
 12  review_hr        45514 non-null  int32         
 13  review_length    45514 non-null  int64         
 14  name             45514 non-null  object        
 15  address          45460 non-null  object        
 16  city             45514 non-null  object        
 17  state            45514 non-null  object        
 18  postal_code      45512 non-null  object        
 19  latitude         45514 non-null  float64       
 20  longitude        45514 non-null  float64       
 21  business_stars   45514 non-null  float64       
 22  review_count     45514 non-null  int64         
 23  is_open          45514 non-null  int64         
 24  attributes       45484 non-null  object        
 25  categories       45514 non-null  object        
 26  hours            45514 non-null  object        
 27  avg_opening_hrs  45514 non-null  float64       
dtypes: datetime64[ns](1), float64(5), int32(4), int64(6), object(12)
memory usage: 9.4+ MB

Identifying Major Cuisine Types:

  • Extracted major_cuisine from categories based on predefined types (Italian, Chinese, Mexican, etc.).
  • Retained only rows with identified major_cuisine in the filtered dataset.
  • Focuses the analysis on specific cuisines of interest.
In [8]:
# Define cuisine types of interest
cuisine_types = ["Italian", "Chinese", "Mexican", "Japanese", "American", "Indian"]

# Function to determine the major cuisine type
def get_major_cuisine(categories):
    for cuisine in cuisine_types:
        if cuisine.lower() in categories.lower():
            return cuisine
    return None  # Return None if no major cuisine is found

# Apply the function to replace categories with the major cuisine type
merged_df['major_cuisine'] = merged_df['categories'].apply(get_major_cuisine)

# Filter to include only rows where a major cuisine was identified
filtered_df = merged_df[merged_df['major_cuisine'].notna()]

# Display the updated DataFrame
print(filtered_df[['categories', 'major_cuisine']].head())

# Check the size of the filtered dataset
print(f"Number of entries with identified major cuisine types: {filtered_df.shape[0]}")
                                          categories major_cuisine
3  American (New), Bars, Sports Bars, Restaurants...      American
5                        American (New), Restaurants      American
7  Food, Sandwiches, American (Traditional), Rest...      American
8              Nightlife, Bars, Mexican, Restaurants       Mexican
9                               Italian, Restaurants       Italian
Number of entries with identified major cuisine types: 28142

Feature Scaling(Numeric)¶

In [9]:
merged_df.select_dtypes(include=['number']).columns
Out[9]:
Index(['review_stars', 'useful', 'funny', 'cool', 'year', 'month', 'day',
       'review_hr', 'review_length', 'latitude', 'longitude', 'business_stars',
       'review_count', 'is_open', 'avg_opening_hrs'],
      dtype='object')
  • Scaled review_count, business_stars, business_stars, latitude, longitude, year, month, day, review_hr, review_length, avg_opening_hrs using Z-score standardization.
  • Ensures features have a mean of 0 and a standard deviation of 1.
  • Prepares the dataset for clustering and PCA by treating all features equally.
  • we exclude review_stars in PCA data matrix, because that is considered the variable of interest(y label), which is not included in PCA nor clustering
In [10]:
from sklearn.preprocessing import StandardScaler

# Select numerical columns to scale
#numerical_features = ['review_count', 'review_stars', 'business_stars']
numerical_features = [
    'review_count','business_stars', 'latitude', 'longitude', 
    'year', 'month', 'day', 'review_hr', 'review_length', 
    'avg_opening_hrs']

# Standardization (Z-Score)
standard_scaler = StandardScaler()

# Using .loc to avoid SettingWithCopyWarning
filtered_df.loc[:, numerical_features] = \
    standard_scaler.fit_transform(filtered_df[numerical_features])

# Display the scaled dataset
print(filtered_df[numerical_features].head())
   review_count  business_stars  latitude  longitude      year     month  \
3     -0.290677       -2.175331  0.776803   0.953303 -1.806387  1.335569   
5      0.669448        1.224356 -1.592027   0.430804  1.480650  0.161579   
7     -0.311656       -1.325409  0.737862   0.943432 -0.491572 -0.718914   
8     -0.356084       -0.475487 -0.728070  -1.502917 -0.162868 -0.718914   
9     -0.574518       -0.475487  0.833902   0.945169 -0.491572  1.335569   

        day  review_hr  review_length  avg_opening_hrs  
3 -0.885121  -1.426214      -0.175910         1.849057  
5 -1.565276  -1.302427      -0.692089         0.613000  
7  0.701906  -1.302427       1.465075         0.458492  
8  0.135111   0.059223      -0.470595        -0.983575  
9 -0.091607  -1.178641       0.848743        -1.426495  
/var/folders/1d/1xd79pj16zq3cp07nr95yrrr0000gn/T/ipykernel_21341/160581636.py:14: FutureWarning: Setting an item of incompatible dtype is deprecated and will raise in a future error of pandas. Value '[-0.29067667  0.66944802 -0.31165626 ...  0.09312639 -0.21169469
 -0.29191077]' has dtype incompatible with int64, please explicitly cast to a compatible dtype first.
  filtered_df.loc[:, numerical_features] = \
/var/folders/1d/1xd79pj16zq3cp07nr95yrrr0000gn/T/ipykernel_21341/160581636.py:14: FutureWarning: Setting an item of incompatible dtype is deprecated and will raise in a future error of pandas. Value '[-1.80638667  1.48064979 -0.49157209 ...  0.8232425  -1.47768302
  0.8232425 ]' has dtype incompatible with int32, please explicitly cast to a compatible dtype first.
  filtered_df.loc[:, numerical_features] = \
/var/folders/1d/1xd79pj16zq3cp07nr95yrrr0000gn/T/ipykernel_21341/160581636.py:14: FutureWarning: Setting an item of incompatible dtype is deprecated and will raise in a future error of pandas. Value '[ 1.33556923  0.16157905 -0.7189136  ...  1.33556923  0.16157905
 -0.42541605]' has dtype incompatible with int32, please explicitly cast to a compatible dtype first.
  filtered_df.loc[:, numerical_features] = \
/var/folders/1d/1xd79pj16zq3cp07nr95yrrr0000gn/T/ipykernel_21341/160581636.py:14: FutureWarning: Setting an item of incompatible dtype is deprecated and will raise in a future error of pandas. Value '[-0.88512109 -1.56527575  0.70190647 ...  1.0419838   1.0419838
  1.0419838 ]' has dtype incompatible with int32, please explicitly cast to a compatible dtype first.
  filtered_df.loc[:, numerical_features] = \
/var/folders/1d/1xd79pj16zq3cp07nr95yrrr0000gn/T/ipykernel_21341/160581636.py:14: FutureWarning: Setting an item of incompatible dtype is deprecated and will raise in a future error of pandas. Value '[-1.42621372 -1.30242731 -1.30242731 ... -1.05485448 -1.1786409
 -0.06456318]' has dtype incompatible with int32, please explicitly cast to a compatible dtype first.
  filtered_df.loc[:, numerical_features] = \
/var/folders/1d/1xd79pj16zq3cp07nr95yrrr0000gn/T/ipykernel_21341/160581636.py:14: FutureWarning: Setting an item of incompatible dtype is deprecated and will raise in a future error of pandas. Value '[-0.17591048 -0.69208909  1.46507526 ...  0.88148526 -0.74216611
 -0.71134948]' has dtype incompatible with int64, please explicitly cast to a compatible dtype first.
  filtered_df.loc[:, numerical_features] = \

Feature Scaling(Categorical)¶

  • Encoded major_cuisine into binary columns using OneHotEncoder.
  • Used sparse_output=False to generate a dense array suitable for DataFrame conversion.
  • Concatenated the encoded features with filtered_df.
  • Dropped the original major_cuisine column after encoding.
  • Prepares the categorical data for numerical analysis and machine learning models.
  • We decided to exclude state variable because it adds a lot of column vectors and has trivial impact on the clustering performance. Lastly, it is less correlated to other variables
In [11]:
from sklearn.preprocessing import OneHotEncoder
import pandas as pd

# Select categorical columns to encode
# categorical_features = ['major_cuisine', 'state']
categorical_features = ['major_cuisine']

# Create OneHotEncoder instance
one_hot_encoder = OneHotEncoder(sparse_output=False)  # Set sparse=False to get a dense array

# Fit and transform the categorical features
encoded_features = one_hot_encoder.fit_transform(filtered_df[categorical_features])

# Convert encoded features to a DataFrame
encoded_df = pd.DataFrame(encoded_features, columns=one_hot_encoder.get_feature_names_out(categorical_features))

# Concatenate the encoded features with the original DataFrame
filtered_df = pd.concat([filtered_df.reset_index(drop=True), encoded_df.reset_index(drop=True)], axis=1)

# Drop the original categorical column
filtered_df.drop(columns=categorical_features, inplace=True)

# Display the updated DataFrame
print(filtered_df.head())
                review_id                 user_id             business_id  \
0  18E_haOfOm8ks-A7SlVWRg  bnDZpsii_if2_wpn8oPcig  bK0j7YtVyN98UnM_8fUONg   
1  c7IQ5alG0pl9yCITtsIlrA  ZLKpeCqbCMWfNeT6yU8wUQ  zT2OzXDWKK1abapHs2RUrQ   
2  YHIicUo2zqA5zwe-lXhsNw  CEZMiWrgtF67m0GUm19ZJA  nKpWUL3kMt4cnNQhye2WqA   
3  3KFxmw4RG5E4ActnP8VPCQ  IOJnU62iJL1LM_X6A_p1xw  vtR2MjFToKkclbUX5DuhlQ   
4  tsCWBn7pc09M3jKiwi4w-g  2ULSyP0EK7LQaavU89efLA  eSJMA_VdUVQTDkRJiV9lHw   

   review_stars  useful  funny  cool  \
0           3.0       1      1     1   
1           5.0       1      0     0   
2           4.0       2      5     1   
3           1.0       6      1     0   
4           2.0       1      0     0   

                                                text                date  \
0  Dirt cheap happy hour specials.  Half priced d... 2011-11-08 01:30:27   
1  Philly cheese steak (loaded)  was phenomenal. ... 2021-07-02 02:17:40   
2  It's almost 10 o clock on a Tuesday and I am t... 2015-04-22 02:01:21   
3  Great ambience and seated quickly after arrivi... 2016-04-17 13:31:53   
4  Tried this restaurant for the first time tonig... 2015-11-15 03:17:19   

       year  ...                                         attributes  \
0 -1.806387  ...  {'NoiseLevel': "u'very_loud'", 'RestaurantsPri...   
1  1.480650  ...  {'RestaurantsReservations': 'False', 'Alcohol'...   
2 -0.491572  ...  {'BikeParking': 'True', 'GoodForMeal': "{'dess...   
3 -0.162868  ...  {'CoatCheck': 'False', 'NoiseLevel': "u'averag...   
4 -0.491572  ...  {'GoodForKids': 'True', 'Alcohol': "u'none'", ...   

                                          categories  \
0  American (New), Bars, Sports Bars, Restaurants...   
1                        American (New), Restaurants   
2  Food, Sandwiches, American (Traditional), Rest...   
3              Nightlife, Bars, Mexican, Restaurants   
4                               Italian, Restaurants   

                                               hours  avg_opening_hrs  \
0  {'Monday': '11:0-2:0', 'Tuesday': '11:0-2:0', ...         1.849057   
1  {'Monday': '10:0-21:0', 'Tuesday': '10:0-21:0'...         0.613000   
2  {'Tuesday': '12:0-22:0', 'Wednesday': '12:0-22...         0.458492   
3  {'Tuesday': '16:0-21:0', 'Wednesday': '16:0-21...        -0.983575   
4  {'Wednesday': '16:0-20:0', 'Thursday': '16:0-2...        -1.426495   

  major_cuisine_American major_cuisine_Chinese major_cuisine_Indian  \
0                    1.0                   0.0                  0.0   
1                    1.0                   0.0                  0.0   
2                    1.0                   0.0                  0.0   
3                    0.0                   0.0                  0.0   
4                    0.0                   0.0                  0.0   

  major_cuisine_Italian major_cuisine_Japanese  major_cuisine_Mexican  
0                   0.0                    0.0                    0.0  
1                   0.0                    0.0                    0.0  
2                   0.0                    0.0                    0.0  
3                   0.0                    0.0                    1.0  
4                   1.0                    0.0                    0.0  

[5 rows x 34 columns]

Dimensionality Reduction¶

  • Selected scaled numerical features processed earlier
  • Included one-hot encoded categorical features: columns starting with major_cuisine_.
  • Combined numerical and categorical features into a new DataFrame features_df.
  • Prepares the data for dimensionality reduction techniques like PCA.
In [ ]:
# Select one-hot encoded categorical features
categorical_features = [
    col for col in filtered_df.columns if col.startswith('major_cuisine_')
]

# Combine scaled numerical features and one-hot encoded categorical features
selected_features = numerical_features + categorical_features
features_df = filtered_df[selected_features]

# Display the prepared DataFrame
print("Prepared Features for Dimensionality Reduction:")
print(features_df.head())
print(f"Shape of features_df: {features_df.shape}")
Prepared Features for Dimensionality Reduction:
   review_count  business_stars  latitude  longitude      year     month  \
0     -0.290677       -2.175331  0.776803   0.953303 -1.806387  1.335569   
1      0.669448        1.224356 -1.592027   0.430804  1.480650  0.161579   
2     -0.311656       -1.325409  0.737862   0.943432 -0.491572 -0.718914   
3     -0.356084       -0.475487 -0.728070  -1.502917 -0.162868 -0.718914   
4     -0.574518       -0.475487  0.833902   0.945169 -0.491572  1.335569   

        day  review_hr  review_length  avg_opening_hrs  \
0 -0.885121  -1.426214      -0.175910         1.849057   
1 -1.565276  -1.302427      -0.692089         0.613000   
2  0.701906  -1.302427       1.465075         0.458492   
3  0.135111   0.059223      -0.470595        -0.983575   
4 -0.091607  -1.178641       0.848743        -1.426495   

   major_cuisine_American  major_cuisine_Chinese  major_cuisine_Indian  \
0                     1.0                    0.0                   0.0   
1                     1.0                    0.0                   0.0   
2                     1.0                    0.0                   0.0   
3                     0.0                    0.0                   0.0   
4                     0.0                    0.0                   0.0   

   major_cuisine_Italian  major_cuisine_Japanese  major_cuisine_Mexican  
0                    0.0                     0.0                    0.0  
1                    0.0                     0.0                    0.0  
2                    0.0                     0.0                    0.0  
3                    0.0                     0.0                    1.0  
4                    1.0                     0.0                    0.0  
Shape of features_df: (28142, 16)

Applying PCA and Visualizing Explained Variance:¶

  • Applied PCA to the prepared features_df to analyze the explained variance of each principal component.
  • Calculated the individual explained variance ratio and the cumulative explained variance.
  • Visualized the results:
    • Bar Plot: Shows the individual explained variance for each component.
    • Scatter Plot and Line: Represent the cumulative explained variance, helping determine the number of components needed to capture most of the variance.
  • Helps identify the optimal number of principal components for dimensionality reduction.
In [13]:
from sklearn.decomposition import PCA
import pandas as pd

# Apply PCA to determine explained variance
pca = PCA()  # Let PCA determine all components
pca_data = pca.fit_transform(features_df)

# Explained variance ratio and cumulative explained variance
explained_variance_ratio = pca.explained_variance_ratio_
cumulative_explained_variance = np.cumsum(explained_variance_ratio)

# Plot the explained variance ratio and cumulative explained variance
plt.figure(figsize=(10, 6))
plt.bar(range(1, len(explained_variance_ratio) + 1), explained_variance_ratio, alpha=0.7, label='Individual Explained Variance')
plt.scatter(range(1, len(cumulative_explained_variance) + 1), cumulative_explained_variance, label='Cumulative Explained Variance', color='blue')
plt.plot(range(1, len(cumulative_explained_variance) + 1), cumulative_explained_variance, linestyle='--', color='blue')

# Add labels and title
plt.xlabel('Principal Component Index')
plt.ylabel('Explained Variance Ratio')
plt.title('Explained Variance by Principal Components')
plt.legend(loc='best')
plt.grid(True)
plt.show()
No description has been provided for this image

Choosing the Number of Principal Components:¶

  • Based on the explained variance plot:
    • The first 9 components explain approximately 90% of the variance in the data.
    • Beyond 10 components, the cumulative variance increases minimally, indicating diminishing returns.
  • Decision:
    • Retain 10 components for dimensionality reduction to capture the majority of the information while reducing noise and complexity.
In [26]:
n_components = 10
pca = PCA(n_components=n_components)

# Fit and transform the data
reduced_features = pca.fit_transform(features_df)

# Create a DataFrame for the reduced features
reduced_df = pd.DataFrame(reduced_features, 
    columns=[f'PC{i+1}' for i in range(n_components)])

# Display the resulting DataFrame
print("Reduced Features with 10 Principal Components:")
print(reduced_df.head())

# Explained variance ratio for the 4 components
explained_variance = pca.explained_variance_ratio_
cumulative_variance = explained_variance.cumsum()
pca_variance_df = pd.DataFrame({
    'Explained Variance Ratio': explained_variance,
    'Cumulative Explained Variance': cumulative_variance,
}, index=[f'PC{i+1}' for i in range(n_components)])
# print("Explained Variance Ratio:", explained_variance)
# print("Cumulative Explained Variance:", cumulative_variance)
pca_variance_df
Reduced Features with 10 Principal Components:
        PC1       PC2       PC3       PC4       PC5       PC6       PC7  \
0 -3.191052  0.447549  0.297477  1.102308 -1.413261 -1.246485  1.492744   
1  1.646259 -1.726046  0.265134 -0.171910 -1.519887 -0.930614  1.229623   
2 -1.766493  0.974424 -0.355674  0.038239 -0.577182  1.521661  0.985861   
3  0.487037 -0.443548 -1.057817  0.151415 -0.337888  0.427243 -1.251094   
4 -0.541648  1.648590 -0.809057  0.558286  0.635419 -0.586650  1.328383   

        PC8       PC9      PC10  
0 -0.423566 -0.047467 -0.203261  
1  0.994702  0.497637  0.545116  
2  0.549622  0.510694 -0.492943  
3 -0.336935 -1.142466 -0.437934  
4  1.163941 -0.538012 -0.687192  
Out[26]:
Explained Variance Ratio Cumulative Explained Variance
PC1 0.138477 0.138477
PC2 0.111295 0.249772
PC3 0.101976 0.351748
PC4 0.096944 0.448692
PC5 0.094133 0.542825
PC6 0.091539 0.634364
PC7 0.088010 0.722374
PC8 0.082884 0.805258
PC9 0.074648 0.879905
PC10 0.060540 0.940445
In [ ]:
plt.figure()
top_components = pd.DataFrame(pca.components_.T, 
                              index = features_df.columns)
#plot the principal components in terms of weights of raw variables
sns.heatmap(top_components, cmap='coolwarm', center=0)
plt.title('Top Components of PCA and associated Features')
plt.xlabel('ith PCA Component')
plt.ylabel('Raw Features')
Out[ ]:
Text(50.5815972222222, 0.5, 'Raw Features')
No description has been provided for this image

Results with 10 Principal Components¶

Explained Variance Ratio

  • Definition: The proportion of the total variance in the data explained by each principal component.
    • PC1: Explains 13.85% of the variance.
    • PC2: Explains 11.12% of the variance.
    • PC3: Explains 10.20% of the variance.
    • PC4: Explains 9.69% of the variance.
    • The remaining explained variance ratio are stored in the left column of the above table

Cumulative Explained Variance

  • Definition: The total variance explained when combining multiple components.
    • PC1 + PC2: Explain 24.97% of the variance.
    • PC1 + PC2 + PC3: Explain 35.17% of the variance.
    • PC1 + PC2 + PC3 + PC4: Explain 44.86% of the variance.

Interpretation of PCA Components

  • By visualizing the PCA components with respect to features in the dataframe, we observe that cuisine doesn't provide discriminative information to capturing the variation in the data.
  • We also observe that complementary attention across different principal components, which reflects the orthogonality of PCA components
  • We interpret some of the salient components. PC1(idx = 0) captures the store-related information. PC2(idx = 1) focuses on macro-review information. PC3(idx = 2)focuses on time information.

Significance:

  • Retaining 10 principal components captures over 90% of the variance, meaning most of the information in the dataset is preserved.
  • This reduction simplifies the dataset while minimizing information loss.

Application:

  • The reduced features can now be used for clustering, classification, or other analyses, with a smaller, more manageable dataset.

2.Clustering and Analysis¶

We perform clustering because we are interested in how review ratings(review_stars) is correlated with other variables. By exploring the hidden nuances among these variables, it better informs restaurants on ways to improve their review ratings. Based on this motivation, clustering better serves this purpose and provide analysis on habits of different customer groups.

K-Means Clustering¶

Intuition: We cluster the data given the informative components from PCA preprocessing. While the PCA plot suggests number of components = 10 is most optimal to between data compression and information perservation, it might not be beneficial in K-means processing, given the tendency of overfitting and noise in lower PCA components. Therefore, we experiment with different number of PCA components and check its silhouette score.

In [ ]:
def plot_kmeans_on_pca(X, range_n_clusters, vis = True):
    '''
    Plot the silhouette score and the KMeans clustering on PCA data
    (From Lecture)
    '''
    for n_clusters in range_n_clusters:
        # Initialize the clusterer with n_clusters value and a random generator
        # seed of 42 for reproducibility.
        clusterer = KMeans(n_clusters=n_clusters, n_init="auto", random_state=42)
        cluster_labels = clusterer.fit_predict(X)
        #features_df.loc[:, f'cluster_{n_clusters}'] = cluster_labels

        # The silhouette_score gives the average value for all the samples.
        # This gives a perspective into the density and separation of the formed
        # clusters
        silhouette_avg = silhouette_score(X, cluster_labels)
        print("For n_clusters =", n_clusters,
            "The average silhouette_score is :", silhouette_avg)
        
        if(vis == False): continue
        fig, (ax1, ax2) = plt.subplots(1, 2)
        fig.set_size_inches(18, 7)

        # The 1st subplot is the silhouette plot
        # The silhouette coefficient can range from -1, 1 but in this example all
        # lie within [-0.1, 1]
        ax1.set_xlim([-0.1, 1])
        # The (n_clusters+1)*10 is for inserting blank space between silhouette
        # plots of individual clusters, to demarcate them clearly.
        ax1.set_ylim([0, len(X) + (n_clusters + 1) * 10])

        # Compute the silhouette scores for each sample
        sample_silhouette_values = silhouette_samples(X, cluster_labels)

        y_lower = 10
        for i in range(n_clusters):
            # Aggregate the silhouette scores for samples belonging to
            # cluster i, and sort them
            ith_cluster_silhouette_values = \
                sample_silhouette_values[cluster_labels == i]

            ith_cluster_silhouette_values.sort()

            size_cluster_i = ith_cluster_silhouette_values.shape[0]
            y_upper = y_lower + size_cluster_i

            color = cm.nipy_spectral(float(i) / n_clusters)
            ax1.fill_betweenx(np.arange(y_lower, y_upper),
                            0, ith_cluster_silhouette_values,
                            facecolor=color, edgecolor=color, alpha=0.7)

            # Label the silhouette plots with their cluster numbers at the middle
            ax1.text(-0.05, y_lower + 0.5 * size_cluster_i, str(i))

            # Compute the new y_lower for next plot
            y_lower = y_upper + 10  # 10 for the 0 samples

        ax1.set_title("The silhouette plot for the various clusters.")
        ax1.set_xlabel("The silhouette coefficient values")
        ax1.set_ylabel("Cluster label")

        # The vertical line for average silhouette score of all the values
        ax1.axvline(x=silhouette_avg, color="red", linestyle="--")

        ax1.set_yticks([])  # Clear the yaxis labels / ticks
        ax1.set_xticks([-0.1, 0, 0.2, 0.4, 0.6, 0.8, 1])

        # 2nd Plot showing the actual clusters formed
        colors = cm.nipy_spectral(cluster_labels.astype(float) / n_clusters)
        ax2.scatter(X[:, 0], X[:, 1], marker='.', s=30, lw=0, alpha=0.7,
                    c=colors, edgecolor='k')

        # Labeling the clusters
        centers = clusterer.cluster_centers_
        # Draw white circles at cluster centers
        ax2.scatter(centers[:, 0], centers[:, 1], marker='o',
                    c="white", alpha=1, s=200, edgecolor='k')

        for i, c in enumerate(centers):
            ax2.scatter(c[0], c[1], marker='$%d$' % i, alpha=1,
                        s=50, edgecolor='k')

        ax2.set_title("The visualization of the clustered data.")
        ax2.set_xlabel("Feature space for the 1st feature")
        ax2.set_ylabel("Feature space for the 2nd feature")

        plt.suptitle(("Silhouette analysis for KMeans clustering on sample data "
                    "with n_clusters = %d" % n_clusters),
                    fontsize=14, fontweight='bold')

    plt.show()

#use first 3 principal components
range_n_clusters = range(2, 10)
X = PCA(n_components=3).fit_transform(features_df)
plot_kmeans_on_pca(X, range_n_clusters)
For n_clusters = 2 The average silhouette_score is : 0.2407215136579905
For n_clusters = 3 The average silhouette_score is : 0.23698273688238533
For n_clusters = 4 The average silhouette_score is : 0.251357852627854
For n_clusters = 5 The average silhouette_score is : 0.22004460454651523
For n_clusters = 6 The average silhouette_score is : 0.2371698021942785
For n_clusters = 7 The average silhouette_score is : 0.22292517315368554
For n_clusters = 8 The average silhouette_score is : 0.22662615111651338
For n_clusters = 9 The average silhouette_score is : 0.22770173176484657
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
In [20]:
#use first 5 principal components
X = PCA(n_components=5).fit_transform(features_df)
plot_kmeans_on_pca(X, range_n_clusters)
For n_clusters = 2 The average silhouette_score is : 0.15592105387582791
For n_clusters = 3 The average silhouette_score is : 0.14966897726272577
For n_clusters = 4 The average silhouette_score is : 0.156425817630668
For n_clusters = 5 The average silhouette_score is : 0.16138163306363063
For n_clusters = 6 The average silhouette_score is : 0.15799503197248857
For n_clusters = 7 The average silhouette_score is : 0.15510327298427973
For n_clusters = 8 The average silhouette_score is : 0.15208334645648836
For n_clusters = 9 The average silhouette_score is : 0.15665460480553653
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
In [21]:
#use first 7 principal components
X = PCA(n_components=7).fit_transform(features_df)
plot_kmeans_on_pca(X, range_n_clusters)
For n_clusters = 2 The average silhouette_score is : 0.11639464234039845
For n_clusters = 3 The average silhouette_score is : 0.12007479814750582
For n_clusters = 4 The average silhouette_score is : 0.13275593451270487
For n_clusters = 5 The average silhouette_score is : 0.13367103763742733
For n_clusters = 6 The average silhouette_score is : 0.1202784571421074
For n_clusters = 7 The average silhouette_score is : 0.12186138230209836
For n_clusters = 8 The average silhouette_score is : 0.12310147157357862
For n_clusters = 9 The average silhouette_score is : 0.1215850616589286
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image

Results with PCA + KMeans¶

  • By observing the average silhouette scores, we observe that increasing the number of PCA components for KMeans analysis actually greatly harms the performance. The optimal number of PCA components is 3. Regardless, the low silhouette scores seem to suggest that Kmeans might not be the optimal choice
  • By applying KMeans on 3-component PCA features, the silhouette scores suggest that optimal number of clusters is 4. Yet, arguably, cluster = 3 could also make sense given that it better separates the data.
  • Lastly, we arrive at the conlusion that Kmeans optimal number of clusters is 4. Given that 2D visualization(PCA-components = 2) can't capture the variation of data that lies in higher-dimension manifold, the silhouette score better captures how each cluster is concentrated within and separated across other clusters.

Agglomerative Clustering¶

We now perform agglomerative clustering, which performs hierarchal clustering by merging similar data points. This type of clustering is more adaptive and relies on less assumptions as compared to KMeans

In [ ]:
plt.figure()
fig, axes = plt.subplots(5, 2, figsize=(20, 40))
X = PCA(n_components=10).fit_transform(features_df)
for cluster_num in range(2, 12):
    i,j = (cluster_num - 2) // 2, (cluster_num - 2) % 2
    #perform agglomerative clustering on the PCA data
    cluster_labels = AgglomerativeClustering(
        n_clusters=cluster_num, linkage='average').fit_predict(X)
    #output the silhouette score to evaluate clustering quality
    silhouette_avg = silhouette_score(X, cluster_labels)
    axes[i,j].scatter(X[:,0], X[:,1], c=cluster_labels)
    axes[i,j].set_title(f'{cluster_num} clusters')
    print(f'For {cluster_num} clusters, the average silhouette_score is: {silhouette_avg}')
For 2 clusters, the average silhouette_score is: 0.3859069175834477
For 3 clusters, the average silhouette_score is: 0.3577723981351766
For 4 clusters, the average silhouette_score is: 0.29328914391836314
For 5 clusters, the average silhouette_score is: 0.29140454937828475
For 6 clusters, the average silhouette_score is: 0.24558293640073994
For 7 clusters, the average silhouette_score is: 0.2452175148625528
For 8 clusters, the average silhouette_score is: 0.21045756819178682
For 9 clusters, the average silhouette_score is: 0.21021208815497672
For 10 clusters, the average silhouette_score is: 0.17581173152773555
For 11 clusters, the average silhouette_score is: 0.17247068027993778
<Figure size 640x480 with 0 Axes>
No description has been provided for this image
In [27]:
#same figure as above but with different number of clusters
plt.figure()
fig, axes = plt.subplots(5, 2, figsize=(20, 40))
X = PCA(n_components=5).fit_transform(features_df)
for cluster_num in range(2, 12):
    i,j = (cluster_num - 2) // 2, (cluster_num - 2) % 2
    cluster_labels = AgglomerativeClustering(
        n_clusters=cluster_num, linkage='average').fit_predict(X)
    silhouette_avg = silhouette_score(X, cluster_labels)
    axes[i,j].scatter(X[:,0], X[:,1], c=cluster_labels)
    axes[i,j].set_title(f'{cluster_num} clusters')
    print(f'For {cluster_num} clusters, the average silhouette_score is: {silhouette_avg}')
For 2 clusters, the average silhouette_score is: 0.4380196805494214
For 3 clusters, the average silhouette_score is: 0.2953958156906736
For 4 clusters, the average silhouette_score is: 0.26348655335428267
For 5 clusters, the average silhouette_score is: 0.17919400592690682
For 6 clusters, the average silhouette_score is: 0.13344845991725302
For 7 clusters, the average silhouette_score is: 0.13289440471072397
For 8 clusters, the average silhouette_score is: 0.10324328871075238
For 9 clusters, the average silhouette_score is: 0.05963217638982547
For 10 clusters, the average silhouette_score is: 0.05914563262152535
For 11 clusters, the average silhouette_score is: 0.051686644796053886
<Figure size 640x480 with 0 Axes>
No description has been provided for this image
In [42]:
plt.figure()
fig, axes = plt.subplots(4, 2, figsize=(20, 40))
agg_df = features_df.copy()
agg_df['review_stars'] = filtered_df['review_stars']
X = PCA(n_components=3).fit_transform(features_df)

for cluster_num in range(2, 10):
    i,j = (cluster_num - 2) // 2, (cluster_num - 2) % 2
    cluster_labels = AgglomerativeClustering(
        n_clusters=cluster_num, linkage='average').fit_predict(X)
    agg_df[f'cluster_{cluster_num}'] = cluster_labels
    silhouette_avg = silhouette_score(X, cluster_labels)
    axes[i,j].scatter(X[:,0], X[:,1], c=cluster_labels)
    axes[i,j].set_title(f'{cluster_num} clusters')
    print(f'For {cluster_num} clusters, the average silhouette_score is: {silhouette_avg}')
/var/folders/1d/1xd79pj16zq3cp07nr95yrrr0000gn/T/ipykernel_21341/3733346113.py:3: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  features_df.drop(columns=['review_stars'], inplace=True)
For 2 clusters, the average silhouette_score is: 0.5176685576133001
For 3 clusters, the average silhouette_score is: 0.5083609340146049
For 4 clusters, the average silhouette_score is: 0.38915342785638585
For 5 clusters, the average silhouette_score is: 0.24581118082182937
For 6 clusters, the average silhouette_score is: 0.16728503970849987
For 7 clusters, the average silhouette_score is: 0.16480647142682803
For 8 clusters, the average silhouette_score is: 0.1206762754372233
For 9 clusters, the average silhouette_score is: 0.11960952215844287
<Figure size 640x480 with 0 Axes>
No description has been provided for this image

Discussions to Agglomerative Clustering¶

Agglomerative Clustering Results

  • By evaluating the silhouette score, we observe that utilizing 3 PCA components yields the best results. Among it, employing number of cluster = 2 or 3 have optimal performance.
  • We also observe that increasing number of clusters would greatly degrade the clustering performance, given its tendency to overfit.

Agglomerative Clustering vs. KMeans

  • As compared to KMeans, Agglomerative Clustering is significantly more robust to the PCA data matrix, with less fluncutation in performance.
  • Agglomerative clustering is more jusitified in the Yelp datasets because the distributions of reviews in the data manifold is usually non-uniform across clusters. It forms clustering based on similarity across data points and constructed internal nodes. On the other hand, KMeans relies on assumption of spherical clusters, equal cluster size, and similar extent of variability. Therefore, it is sensitive to noise or additional features(e.g. more PCA components) that would greatly violate these assumptions. It is also sensitive to the initialization of centroids.
In [ ]:
#Evaluate clusters against review_stars(cluster = 2)
per_cluster_rating = agg_df.groupby('cluster_2')['review_stars'].mean()
per_cluster_rating.plot(kind='bar')
plt.title('Average Review Stars per Cluster')
plt.ylabel('Average Review Stars')
plt.xlabel('Cluster Number')
Out[ ]:
Text(0.5, 0, 'Cluster Number')
No description has been provided for this image
In [40]:
features = [f for f in agg_df.columns if 'cluster' not in f]
#analyze the mean of each feature per cluster and what was embodied
cluster_df = agg_df.groupby('cluster_2')[features].mean()
feature_std = agg_df[features].std()
top_5_features = feature_std.nlargest(5)
cluster_df.loc[:, top_5_features.index]
Out[40]:
review_stars review_count business_stars longitude year
cluster_2
0 3.786011 -0.003737 -0.000827 0.000103 0.000379
1 3.105263 5.530932 1.224356 -0.153185 -0.560773
In [ ]:
#Evaluate clusters against review_stars(cluster = 3)
per_cluster_rating = agg_df.groupby('cluster_3')['review_stars'].mean()
per_cluster_rating.plot(kind='bar')
plt.title('Average Review Stars per Cluster')
plt.ylabel('Average Review Stars')
plt.xlabel('Cluster Number')
Out[ ]:
Text(0.5, 0, 'Cluster Number')
No description has been provided for this image
In [ ]:
features = [f for f in agg_df.columns if 'cluster' not in f]
#analyze the mean of each feature per cluster and what was embodied
cluster_df = agg_df.groupby('cluster_3')[features].mean()
#select features that have the highest variance(most distinguishing)
feature_std = agg_df[features].std()
top_5_features = feature_std.nlargest(5)
cluster_df.loc[:, top_5_features.index]
Out[ ]:
review_stars review_count business_stars longitude year
cluster_3
0 2.375000 0.017162 -0.173367 -0.044754 -1.137423
1 3.105263 5.530932 1.224356 -0.153185 -0.560773
2 3.798974 -0.003929 0.000758 0.000516 0.010831

Discussion of Clustering Results With Respect to Features¶

  • The PCA matrix suggests that cuisine of the restaurant has limited impact on informing other review or restaurant-related features.
  • Both clustering results suggest that the restaurant reviews can be clustered into 2-3 groups. Excessive amount of clustering easily overfits the clustering performance and introduces noise
  • We analyze what each review cluster represent by analyzing the mean features of each cluster. We observe that it is discriminative across review_count, business_stars, longitude, and year. Moreover, it successfully correlates with the target of interest, the review_stars variable. This suggests that the clustering is effective and able to correspond to rating segments. We also observe that the year is positively correlated with the review_stars based on the clustered data frame.
  • By performing clustering, we successfully analyzed the review trends, which could provide insights to restaurants on ways to improve their ratings as well as customer inclinations.